A multiple criteria decision analysis based approach to remove uncertainty in SMP models

Software has to be updated frequently to match the customer needs. If software maintainability is not given priority, it affects the software development life cycle and maintenance expenses, which deplete organizational assets. Before releasing software, maintainability must be estimated, as the impact of bugs and errors can affect the cost and reputation of the organization after deployment. Regardless of the programming paradigm, it’s important to assess software maintainability. Many software maintainability prediction models’ compatibilities with new programming paradigms are criticized because their limited applicability over heterogeneous datasets. Due this challenge small and medium-sized organizations may even skip the maintainability assessment, resulting in huge lose to such organizations. Motivated by this fact, we used Genetic Algorithm optimized Random Forest technique (GA) for software maintainability prediction models over heterogeneous datasets. To find optimal model for software maintainability prediction, the Technique for Order preference by Similarity to Ideal Solution (TOPSIS), a popular multiple-criteria decision-making model, is adopted. From the results, it is concluded that the GA is optimal for predicting maintainability of software developed in various paradigms.


Software maintainability measures and prediction approaches
Software maintainability is hard to measure directly. So, previous researchers used predictive models. Measuring the software's maintainability is dependent on the metrics used, and identifying the appropriate metrics for source code is the most difficult aspect of maintainability prediction. Researchers have estimated software maintainability using the MI, Change and other measures 7 . The MI is most admired among the researchers as it is validated by Hewlett-Packard (HP).
The MI in Eq. (1) was proposed by Coleman et al. 8 . It is a combination of several metrics, including Halstead's Volume (HV), McCabe's cyclomatic complexity (CC), lines of code (LOC), and percentage of comments (COM), and is validated by HP 3 .
The Software Engineering Institute (SEI) has derived an MI in Eq. (2), which is based on the Coleman work in the year 1997. The MI ranges from 0 to 100, where 0 indicates that the software is hard to maintain and as the range increases towards 100, it indicates that the software is maintainable 9 .
Radon 10 , a popular Python tool, uses another derivative, MI, as depicted in Eq. (3). It is calculated based on the following SEI and Visual Studio derivatives.
In the year 2011, Microsoft team blog 11 has reset the ranges or thresholds to 0-9 = Red, 10-19 = Yellow and 20-100 = Green which indicates poor, medium, and high range of maintainability with minor modifications as shown below in Eq. (4). Li and Henry found that complexity, coupling, cohesion, and inheritance metrics had a substantial link with class change volumes after examining two projects, UIMS (User Interface Management System) and QUES (Quality Evaluation System). With the aid of Classic-Ada, information on maintenance activities was collected from two commercial systems. The data was collected over 3 years. The number of lines per class that have been edited determines the maintenance effort. According to Eq. (5), the metric Change might include adding or removing code lines. This suggested measure can be employed to assess an object-oriented system's maintainability.
However, no commonly accepted measure for identifying the relevant code metrics or prediction models is available 12 .
(5) CHANGE = F(WMC, DIT, NOC, RFC, LCOM, MPC, DAC, NOM, S1, S2). www.nature.com/scientificreports/ Researchers have employed individual prediction models like SVR, ANN, LSTM and ensemble models to improve the accuracy prediction of individual models 13 . They can be classified into two major types: homogenous that uses the same type of individual models, and heterogenous that uses different types of individual models. There is a scarcity of the data that is needed to analyze cross-domain projects and other benchmark datasets.

Recent works
In this section, we presented a detailed overview of recent works in predicting the maintainability of software.
In 2021, Iqbal et al. 13 used a supervised learning approach to identify the changes that were required in the legacy system's current software components. New requirements and defect kinds necessitated a thorough redesign of the software components' interfaces. The software maintainability was assessed using the naive Bayes classifier, a machine learning technique. Software components designed with the inverse criteria in mind were found to be error-free and easily adaptable to client needs. The authors employed limited datasets to estimate maintainability using only one machine learning technique, which is a significant flaw in this work. It can be encapsulated using heterogeneous software and recent algorithms.
In 2021, Lakra et al. 14 applied hyperparameter tuning on five regression-based ML algorithms like random forest, ridge regression, support vector regression, stochastic gradient descent, and gaussian process regression for two commercial object-oriented datasets, namely QUES and UIMS. The results exhibited substantial improvements when compared to the existing base models. The work primarily focuses only on fine-tuning models, lacks the usage of heterogeneous software, and has the limitation of not addressing the uncertainty in software maintainability prediction models. The work also uses only a few datasets, which is a serious drawback.
In 2020, Elmidaoui et al. 15 conducted a study on empirical evidence for the accuracy of software product maintainability prediction (SPMP) using ML techniques. The after-effects of about 77 studies that were published between 2000 and 2018 are inspected in this study based on the following criteria: maintainability prediction approaches, validation methods, accuracy criteria, the overall accuracy of ML techniques, and the techniques with the best performance. In the maximum number of studies, ML techniques' performance exceeded the non-ML techniques' , such as regression analysis (RA), whereas fuzzy and neuro-fuzzy (FNF) outscored SVM/R, DT, and ANN. When several techniques were claimed to be superior, no specific technique could be recognized as the best, which is a serious limitation in this work.
In 2020, Malhotra et al. 16 used nine oversampling and three under-sampling approaches on unbalanced data. A detailed comparison of fourteen ML and fourteen search-based strategies has been taken into consideration to predict the class maintainability. This work supports the use of the Safe-Level Synthetic Minority Oversampling Technique (Safe-SMOTE) in handling imbalanced data when predicting class maintainability. This work has certain limitations compared to heterogeneous techniques and recent ML techniques. This work has an ambiguity in choosing an appropriate algorithm for estimating maintainability, which is a major drawback.
In 2020, Gupta et al. 17 proposed an enhanced-RFA (Random Forest Algorithm) technique for software maintainability prediction. The suggested method combines the random forest (RF) algorithm with three widely used feature selection techniques: chi-square, RF, and linear correlation filter, as well as a re-sampling strategy to increase the core RF algorithm's prediction accuracy. Using R 2 , the performance of enhanced-RFA is assessed on two commercially available datasets, namely QUES and UIMS. The proposed approach performs much better than RFA for the specified datasets using chi-square, RF, and linear correlation filter approaches. This work has limitations in comparison to heterogeneous techniques and does not compare itself to recent ML techniques.
In 2020, Malhotra et al. 18 implemented several ML, statistical (ST) and hybridization (HB) techniques to create prediction models for software maintainability in this work. The important finding is that ML-based models outperform ST models in terms of overall performance. The use of HB methods for software maintainability prediction is restricted. It is encouraging that this work has reported the prediction performance of a few models developed using HB techniques, but no conclusive results about the performance of any of these techniques are reported and this paper ignores the metaheuristics.
From the above literature survey, it is understood that the most popular SMP techniques are the statistical model and individual ML models. In SMP, the ML models performed better than the statistical models. To increase the effectiveness of the models and their accuracy, the researchers are employed ensemble models. Several SMP models claim superiority. However, their performance across diverse programming paradigms is questionable, making the selection of an SMP model uncertain. This necessitates an approach focused on minimization of effort when selecting an SMP model for diverse programming paradigms.
The summary of the recent works is depicted in Table 1.

Predicting software maintainability
This section presents a overview of proposed methodology for assessing the maintainability of automated software.
Experimental setup. The software maintainability prediction models are implemented using R Programming. The aim is to reduce the prediction error and improve the robustness of the proposed model. GA compares its performance over other models like step-wise regression, support vector machine, NN, MARS, and CART. The datasets are obtained from the Li and Henry's work and PROMISE repository. Quality Evaluation System (QUES) and User Interface System (UIMS) datasets of commercial software products were originally released by Li and Henry in 1993 in their work on object-oriented metrics that predict maintainability [19][20][21][22] 23 . The details of these datasets are described in Table 2. Objective function. The motivation of this paper is to find the better software maintainability prediction model with less error ratio when applied to heterogeneous datasets. The error is estimated based on the difference between MI actual and the predicted MI predicted , which is defined in Eq. (6). Solution encoding. W that represents the optimized weight, and the solution proposed is illustrated in Genetic algorithm. To improve prediction, this paper uses an evolutionary computing and most admired algorithm, GA 26 . GA algorithm imitates human evolution, in particular, gene evolution, and is inspired by Charles Darwin's theory of natural evolution. This algorithm represents the natural selection mechanism where the fittest individuals are chosen for succession to generate next-generation offspring. Parallelism is supported by the genetic algorithm, which is easily modified and adaptable to various problems. It is easy to disseminate and can search a large and diverse solution space. A non-knowledge-based optimization process is used to evaluate the fitness function. Finding the global optimum and avoiding becoming trapped in the local optimum is simple 27 . A suite of potential solutions can be returned by multi-objective optimization. GA is appropriate for large-scale and diverse optimization problems. The five rules of applying genetic algorithm are shown in Fig. 3.

Results and discussion
The proposed methodology for software maintainability prediction is implemented using R programming. The developed model aims to reduce the error rate. Six popular techniques, namely, SWR, SVM, NN, MARS, GA, and CART are considered for software maintainability prediction and their performance is evaluated based on RMSE, MAE, and R 2 .
Performance analysis. A critical step in any empirical study is determining the predicted model's accuracy. The model predicts the value of the dependent variable, which is then compared to the actual value to discover errors. The current work compares various popular ML techniques, statistical (ST), and metaheuristic techniques using the following measures.
The mean absolute error (MAE) 8 in Eq. (7), is a standardized measure used to find differences between the actual and anticipated values of a dependent variable. MAE calculates the difference between the actual and anticipated values first and then divides the result by the actual value. After that, each data point's absolute value is added together and divided by the entire number of data points.
The difference between predicted and actual values for each class is squared, then averaged, and finally the square root of the average value is calculated in RMSE 8 as Eq. (8).
In a regression model, R 28 Eq. (9), is the amount of variation explained by an independent variable or factors for a dependent variable. The R 2 value indicates how much the variance of one variable explains the variance of the other.
In regression analysis, the residual squared error (R 2 ), mean absolute error (MAE), and root mean squared error (RMSE) metrics are used to assess the model's performance. The lower the value of MAE, RSQ, and RMSE, the more accurate a regression model is considered to be. Compared to the MSE, the mean squared error (MSE) and RMSE penalize the big prediction errors (MAE). However, because it has the same units as the dependent variable, RMSE is more commonly used to evaluate the efficiency of the regression model when compared to other random models than MSE (Y-axis).
The UIMS dataset is private and the source code is entirely developed using Java programming which contains 39 instances and 11 features. The genetic algorithm has achieved better performance compared to other models.
The QUES dataset is private and the source code is entirely developed using Java programming which contains 71 instances and 11 features. Neural networks have achieved better performance compared to other models.
The CM1 dataset is public and the source code is entirely developed using C programming which contains 505 instances and 40 features. Neural networks have achieved better performance compared to other models.
The JM1 dataset is public and the source code is entirely developed using C programming which contains 10,878 instances and 21 features. Neural networks 26 have achieved better performance compared to other models.
The KC1 dataset is public and the source code is entirely developed using C++ programming which contains 2107 instances and 21 features. Genetic algorithm has achieved better performance compared to other models.
The KC3 dataset public and the source code is entirely developed using Java programming which contains 458 instances and 40 features. MARS have achieved better performance compared to other models.
The MC1 dataset is public and the source code is entirely developed using C and C++ programming which contains 9466 instances and 39 features. Genetic algorithm has achieved better performance compared to other models.
The MC2 dataset is public and the source code is entirely developed using C programming which contains 161 instances and 40 features. MARS has achieved better performance compared to other models.
The MW1 dataset is public and the source code is entirely developed using C programming which contains 403 instances and 40 features. Genetic algorithm has achieved better performance compared to other models.
The PC1 dataset is public and the source code is entirely developed using C programming which contains 1107 instances and 40 features. MARS has achieved better performance compared to other models.
The PC2 dataset is public and the source code is entirely developed using C programming which contains 5589 instances and 40 features. Genetic algorithm has achieved better performance compared to other models.
The PC3 dataset is public and the source code is entirely developed using C programming which contains 1563 instances and 40 features. MARS has achieved better performance compared to other models.
The PC3 dataset is public and the source code is entirely developed using C programming which contains 1458 instances and 40 features. Genetic algorithm has achieved better performance compared to other models.
The PC5 dataset is public and the source code is entirely developed using C programming which contains 17,186 instances and 39 features. Genetic algorithm has achieved better performance compared to other models.
The results of all the performance measures (MAE, RSQ, and RMSE) on various datasets are presented in Figs. 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 and 17. Based on UIMS, QUES, CM1, and JM1 data sets the best algorithm is NN for predicting the maintainability, (RMSE(UIMS) = 8.57%, RMSE(QUES) = 0.7%, Actual value − Anticipated value Actual value .            These results reveal that GA is the best algorithm for 36% of the data sets, followed by NN 29 , MARS (29%). The worst algorithms for predicting maintainability are CART (1%), stepwise, and SVM. The results show that the best models in SMP are GA, NN, and MARS. These SMP models outperformed other models on heterogeneous datasets. The GA algorithm also outperformed other algorithms on datasets with a large number of instances and features. It is interesting to note that the performance of GA has decreased with some datasets that have a large number of features but a small number of instances. It is understood that GA works best when the search space is large and there are a large number of parameters involved. However, the applicability of GA over all the heterogeneous datasets is still questionable 30 and still creates uncertainty in SMP model selection. To overcome this issue, we used a MCDA approach to prove its superiority of GA over all the heterogeneous datasets.
Technique for order preference by similarity to ideal solution (TOPSIS). MCDA model, the TOPSIS 28 , was applied for ranking algorithms on different datasets. It computes the distances between a real solution and its ideal as well as negative ideal counterparts 31 . We make use of the following algorithm.
Step 2 Calculate weighted In Eq. (11), w i represents the weight.
Step 3 Find the ideal solution S + and the negative ideal solution S − (10) (11) v ij = w i r ij .  www.nature.com/scientificreports/ In Eqs. (12) and (13), I′ and I′′ represent the benefit criteria and cost criteria, respectively. Stability, TPR, TNR, accuracy and AUC are benefit criteria. Runway.
Step 4 Calculate the Euclidean distance between the real and ideal solutions using Eqs. (14) and (15).
Step 6 Rank feature selection methods by maximizing R + j Eq. (16). To further remove the uncertainty in SMP models, popular multi-criteria decision-making technique, the TOPSIS method, was applied for ranking algorithms on different data sets 32 . TOPSIS provides trade-offs between criteria, allowing a good performance in one criterion to offset a bad result in another. Instead of using noncompensatory approaches, which include or exclude alternate solutions based on strict cut-offs, this offers a more realistic modelling approach when compared to Multi-objective Optimization on the basis of Ratio Analysis (MOORA) and A New Additive Ratio Assessment method (ARAS).
The ranking algorithms on different data sets are presented in Table 3. From Table 3, it is observed that GA is the best model for predicting the maintainability of heterogeneous software. Further, it is observed that the correlation between TOPSIS and the other two MCDAs (MOORA and ARAS) used in this study was strong 33 . Based on the performance analysis and MCDA ranking, it is understood that GA is an optimum model for predicting the maintainability of heterogeneous software.

Threat to validity
Limitations encountered during this study are listed below: 1. The results obtained in this study are based on NASA, Li and Henry datasets which were developed using C, C++ , and Java. This study's model is applicable to those paradigms only. Further research on various languages can be conducted to improve usability. 2. The code metrics considered for this study can be expanded with new code metrics that influence software code and design. It can play a significant role in predicting maintainability. 3. The NASA datasets are from automated satellite applications, but the use of real-time advanced automated application data will improve the SPM's reliability 34 . 4. The explainability of the predictions results still remains a concern. 5. The GA algorithm takes a huge amount of processing time and computational resource to produce the prediction results compared to other SMP models. Table 3. Comparative analysis of multiple criteria decision analysis on SMP models.

Data sets
Comparative analysis of multiple criteria decision analysis on SMP models

Conclusion
This paper focused on removing the ambiguity in maintainability prediction models for predicting the maintainability of heterogeneous software. In this concern, various popular publicly available datasets of heterogeneous applications are considered and maintainability is predicted using GA over five popular techniques, namely, step-wise regression, support vector machine, neural networks, multivariate adaptive regression splines, and classification and Regression Tree. To choose the optimum model for predicting maintainability of heterogeneous software, multiple criteria decision-making model named TOPSIS is considered. The overall analysis has shown the efficiency of the proposed model over other popular maintainability prediction models. A range of possible future works have been identified while doing this research. There is a need of adapting real-time heterogeneous data for predicting maintainability. In addition, many other techniques and code metrics should be further adapted to enhance the estimation of maintainability.

Data availability
The datasets UIMS and QUES analyzed during the current study are available in the Li et al. 19